Methods for detecting copy-number variations in next-generation sequencing

ABSTRACT

Copy Number Variants (CNV) detection methods may integrate CNV detection into workflow for next generation sequencer (NGS) data processing, in parallel with SNP and INDEL variant calling. CNV detection methods may analyze coverage patterns across a set of genomic regions and across samples from different patients. The methods do not require specifically chosen reference samples as, but automatically select reference samples from the same batch, for each sample tested. CNV detection methods may detect CNVs in a set of samples without assumptions about CNV status of any of those samples. Embodiments herein may apply the CNV detection scheme iteratively to improve detection performance. The proposed methods may further comprise the step of iteratively feeding back information about the CNVs in the samples into the next iteration step. The methods may also use information from the NGS workflow, such as information on SNP fractions, as input to the NGS CNV detection.

FIELD OF THE INVENTION

Methods described herein relate to genomic analysis in general, and morespecifically to next generation sequencing applications.

BACKGROUND OF THE INVENTION

Next-Generation Sequencing

Next-generation sequencing (NGS) or massively parallel sequencing (MPS)technologies have significantly decreased the cost of DNA sequencing inthe past decade. NGS has broad application in biology and dramaticallychanged the way of research or diagnosis methodologies. For example, RNAexpression profiling or DNA sequencing can only be conducted with a fewnumbers of genes with traditional methods, such as quantitative PCR orSanger sequencing. Even with microarrays, profiling the gene expressionor identifying the mutation at the whole genome level can only beimplemented for organisms whose genome size is relatively small. WithNGS technology, RNA profiling or whole genome sequencing has become aroutine practice now in biological research. On the other hand, due tothe high throughput of NGS, multiplexed methods have been developed notjust to sequence more regions but also to sequence more samples.Compared to the traditional Sanger sequencing technology, NGS enablesthe detection of mutation for much more samples in different genes inparallel. Due to its superiorities over traditional sequencing method,NGS sequencers are now replacing Sanger in routine diagnosis. Inparticular, genomic variations of individuals can now be routinelyanalyzed for a number of medical applications ranging from geneticdisease diagnostic to pharmacogenomics fine-tuning of medication inprecision medicine practice. NGS consists in processing multiplefragmented DNA sequence reads, typically short ones (less than 300nucleotide base pairs). The resulting reads can then be compared to areference genome by means of a number of bioinformatics methods, toidentify small variants such as Single Nucleotide Polymorphisms (SNP)corresponding to a single nucleotide substitution, as well as shortinsertions and deletions (INDEL) of nucleotides in the DNA sequencecompared to its reference.

Targeted Enrichment

In some pathologies, a specific gene variant has been associated withthe illness, such as the BRCA1 and BRCA2 genes in certain forms ofhereditary breast and ovarian cancers or the CFTR gene in cysticfibrosis. Rather than sequencing the whole genome (WGS) from anindividual sample, the genomic analysis can focus on the genome regionassociated with the illness, by targeting, with a set of region-specificDNA primers or probes, and enriching or amplifying, for instance withPCR (Polymerase Chain Reaction), the biological DNA sample specificallyfor sub-regions corresponding to the gene along the DNA strand. A numberof next generation sequencing assays have now been developed along thoseprinciples as ready-to-use biological kits, such as for instance theMultiplicom MASTR or the Illumina TruSeq® Amplicon assay kits tofacilitate DNA based diagnostics with next generation sequencers, suchas for instance the Illumina MiSeq® sequencer, in medical research andclinical practice.

Target enrichment may be achieved from a small sample of DNA by means ofprobe-based hybridization (on arrays or in-solution) or highlymultiplexed PCR-based targeted exon enrichment, so that both the genecoverage/read depth and the amplification specificity (amplifying theright region, as measured by further alignment to the desired targetregions) are maximized. Examples of commercially available targetenrichment systems include Agilent SureSelect™ Target Enrichment System,Roche NimbleGen SeqCap EZ, Illumina Nextera Rapid Capture, AgilentHaloplex™′ and Multiplicom MASTR′.

In order to maximize the use of the massively-parallel processing NGSsequencer, a number of samples are multiplexed in the targeted NGSexperiment—a pool of 48 or more target enrichment samples can thus besimultaneously input to the Illumina MiSeq sequencer for instance. Rawsequencing data out of the NGS sequencer may then be analyzed toidentify specific subsequences, for instance by alignment to a referencegenome. As a result, the amplification may produce more than a thousandreads for a given amplicon in a patient sample.

CNV Detection

In practice, beyond SNPs and INDELs, a number of pathologic geneticvariants are caused by more significant changes in the DNA sequence. Acopy-number variant (“Copy-number value”, “Copy-number variation”, orCNV) quantifies the number of copies of a particular region in thesample DNA sequence, that may be subject to long duplications (number ofcopies above the normal value) or deletions (number of copies below thenormal value) of possibly more than several hundreds of nucleotides whencompared to the reference genome. While next generation sequencingmethods have been shown more efficient than traditional Sangersequencing in the detection of SNPs and INDELs, detection of CNVs intargeted NGS raises a number of specific challenges for alignment to thereference genome or matching to some specific subsequences, as the readlength is typically lower than 300 bp, i.e. a shorter sequence than theoverall CNV regions. State-of-the-art CNV detection methods such as MLPA(Multiplex Ligation-dependent Probe Amplification) still require aseparate experiment and genomic analysis workflow. This limits theadvantages of NGS in practical genomic analysis applications, asdifferent workflows to process different patient samples need to beconducted to detect the CNVs of pathological importance. Also, the stateof the art CNV detection methods are low throughput and cannotsimultaneously check CNVs for a large number of samples and regions inparallel. A number of solutions have thus been recently proposed in theliterature to better address CNV detection with NGS workflows. Oneapproach, as described for instance in WO2014151511, consists incomparing the level of target amplicons to the level of a controlamplicon so as to determine the presence of a CNV. However, this methodis very sensitive to the choice of the control amplicon, which may notbe readily available. Another approach consists in further optimizingthe target enrichment step so as to have a better reference for CNVdetection from the target enrichment sample pool itself. For instance,WO2015112619 discloses the use of dummy primers to assign a unique setof reference nucleotide sequences to each bin of pre-sorted amplicons,at the expense of an extra PCR amplification step and iterativeexhaustive search by the CNV detection module. WO2014083147 proposes tooptimize the PCR primers with a complementary region to the sequence tobe analyzed on the 3′ end and a non-complementary region on the 5′ end.The latter methods require the use of a specific assay kit, which is toolimitative for many current applications.

There is therefore a need for a better solution to efficiently detectCNVs, possibly for a large number of samples and regions simultaneously,within a single targeted Next Generation Sequencing experiment,regardless of which underlying target enrichment technology has beenused, and in as automated a workflow as possible to facilitate theresearch and clinical laboratory practice over the prior art methods.

BRIEF SUMMARY

The foregoing advantages may be achieved by a method for detecting copynumber values (CNV) from a pool of DNA samples enriched with a targetenrichment technology, each enriched DNA sample being associated with alibrary of pooled fragments from a set of amplicons/regions, eachamplicon/region being sequenced with a high-throughput sequencer togenerate coverage count for each sample and for each amplicon/region,comprising: normalizing, with a data processing unit, the coverage countassociated with each sample; selecting, with a data processing unit, foreach sample, a set of reference samples, within the pool of DNA samples,as the samples with the closest normalized coverage count to said samplenormalized coverage count; and for each sample, estimating the copynumber values in said sample as a function of at least the coveragecounts in said sample and of at least the coverage counts in theselected set of reference samples for said sample.

The number of reference samples may be a function of the total number ofsamples. It may be smaller than the total number of samples.

Selecting a set of reference samples may comprise calculating a distancebetween the coverage counts normalized both within each sample/plex andwithin each region. The distance may be the Euclidean distance.

Normalizing the coverage count and/or selecting a set of referencesamples may depend on an estimate of the copy number values for eachsample and for each amplicon/region at a previous iteration. The priorestimate of the copy number values may be pre-defined. The priorestimate of the copy number values may be calculated iteratively,starting with a pre-defined prior estimate of the copy number values andusing the result of the CNV detection at each iteration as the priorestimate of the copy number values in the subsequent iteration, untilthe copy number values estimate converges, reaches a cycle, or thenumber of iterations reaches a pre-defined limit.

For each sample and for each amplicon/region, the likelihood for eachpossible copy number value may be estimated. A Hidden Markov Model maybe further used to estimate the copy number values and their confidencelevels for each amplicon/region. Possible copy number values for whichthe confidence level is below a minimum threshold may be filtered outfrom the results.

The estimate of the copy number values may be calculated usinginformation on the SNP fractions and coverage count.

A principal-component filter may be applied to the coverage count.

The number of reference samples may depend on the iteration index. Inone iteration, the number of reference samples N_(R) may equal the totalnumber of samples N, while at another iteration the number of referencesamples N_(R) may be different from the total number of samples N.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 represents a targeted NGS genomic analysis functional workflow.

FIG. 2 schematically shows the structure of a coverage information tableas input to the genomic data analyzer CNV detection module according apossible embodiment of the disclosure.

FIG. 3A illustrates exemplary amplicon-based coverage information asretrieved and visualized with the IGV software from differentstate-of-the art targeted NGS platforms.

FIG. 3B illustrates exemplary probe-based sample coverage information asretrieved and visualized with the IGV software from differentstate-of-the art targeted NGS platforms.

FIG. 4 shows a flowchart of a genomic data analyzer CNV detection moduleaccording to a possible embodiment of the disclosure.

DETAILED DESCRIPTION

Genomic Analysis System

FIG. 1 shows an exemplary genomic analysis system in accordance with apossible embodiment of the disclosure, comprising a targeted enrichmentassay 100, a next generation sequencer 110 and a genomic data analyzer120.

A pool of DNA samples is processed by the targeted enrichment assay 100to generate a library of DNA fragments prepared by amplicon-basedenrichment or probe-based enrichment as input to the next generationsequencer 110, each set of fragments corresponding to a different DNAsample. The number of fragments is application dependent. For example,in some amplicon-based experiments, target enrichment may require 150primers to enrich 75 different regions to be targeted out of the samplegenome. In other probe-based experiments, probe enrichment may select,for example, DNA fragments from 413 selected regions. The number ofsamples may also be adapted to the next-generation sequencing sequencer110 parallel processing capability, for instance 48 samples may besequenced in parallel by an Illumina MiSeq sequencer. Other NGSsequencer technologies may be used, such as for instance the Roche 454™GS Junior or GS FLX, Illumina MiSeq®, or Life Technologies Ion PGM™sequencers.

The next-generation sequencer 110 analyses the input samples andgenerates sequence reads in a computer-readable file format representingraw NGS sequencing data. Depending on the NGS technology, one or morefiles may be output by the NGS sequencer 110. In some embodiments, theFASTQ file format may be used with two different files for forward andreverse reads or as a single joined file. Other embodiments are alsopossible. The raw NGS sequencing data is further input to the genomicdata analyzer 120.

The genomic data analyzer 120 computer system (also “system” herein) 120is programmed or otherwise configured to implement different genomicdata analysis methods, such as receiving and/or combining sequencingdata and/or annotating sequencing data.

The genomic data analyzer 120 may be a computer system or part of acomputer system including a central processing unit (CPU, “processor” or“computer processor” herein), memory such as RAM and storage units suchas a hard disk, and communication interfaces to communicate with othercomputer systems through a communication network, for instance theinternet or a local network. In some embodiments, the computer systemmay comprise one or more computer servers, which may enable distributedcomputing, such as cloud computing, for instance in a genomic data farm.In some embodiments, the genomic data analyzer 120 may be integratedinto a massively parallel system. In some embodiments, the genomic dataanalyzer 120 may be directly integrated into a next generationsequencing system.

As illustrated on FIG. 1, the genomic data analyzer 120 may comprise analignment module 121, which compares the raw NGS sequencing data to areference genome. The alignment results (which may be represented as oneor several files in BAM, SAM or other similar formats, as known to thoseskilled in the bioinformatics art) may be further analyzed in search forSNP and INDEL polymorphisms by means of a SNP/INDEL detection module124. Alignment information may be further filtered and analyzed toretrieve coverage information (or coverage count). In an embodiment ofthe present disclosure, a coverage extraction module 122 may process thealignment data to extract coverage information in accordance with thetargeted enrichment 100 and NGS sequencer 110 technologies appliedupstream in the overall genomic analysis workflow. A CNV detectionmodule 123 in accordance with the disclosure may then analyze thecoverage information to identify and qualify copy-number variants (CNVs)in the original DNA samples. In some embodiments, the CNV detectionmodule 123 may operate iteratively, by using CNV detection informationfrom a former step in a next iteration. As can be seen in FIG. 1, thesame NGS experiment with a single target enrichment step 100 and asingle sequencing step 110 can thus be used to analyze differentSNP/INDELs and CNV genomic variants simultaneously, instead of runningseparate NGS/SNP-INDELS detection and MLPA/CNV detection experiments asin the prior art genomic data analysis workflows.

CNV Detection—Overall Scheme

FIG. 2 schematically shows a coverage information table out of thecoverage, where the rows represent amplicons (or suitably definedregions, in the case of a probe-based technology) and columns representsamples. The symbols ‘*’ in the table represent the coverage information(coverage count) for each amplicon/region in each sample. This coveragecount is defined as a suitable function of the numbers of forward reads,reverse reads, and read pairs (in case of pair-ended sequencing)corresponding to a given amplicon or region in a given sample. Thedefinition of “correspondence” to an amplicon may be based on the matchbetween the beginning and/or the end of the read to the beginning and/orend of the amplicon, and the “correspondence” to a region may be basedon the overlap of the read with the region. However other suitabledefinitions may also be used in different embodiments. Additionalfilters based on the read parameters (for example, read length ormapping quality) may also be used. In one of the embodiments, thecoverage is defined as the sum of the number of forward reads and thenumber of reverse reads, while in another embodiment the coverage isdefined as the number of read pairs. In yet other embodiments otherfunctions may be used (for example, using only forward-read or onlybackward-read counts, or the maximal of the two counts). Furthermore,depending on the properties of the target-enrichment assay, differentfunctions may be used for different amplicons/regions. FIG. 3illustrates possible choices of the coverage definition inamplicon-based and in probe-based designs, as viewed in IGV softwareusing aligned reads. FIG. 3a ) shows the amplicon coverage as the numberof read pairs aligned to the corresponding region. FIG. 3b ) shows theregion coverage as the total number of reads (both forward and reverse)overlapping with a suitably chosen target region.

In the case where the CNV corresponds to a whole amplicon or regionbeing duplicated, the NGS coverage corresponding to this region will beunusually high. Conversely, when the CNV corresponds to a whole ampliconor region being deleted, the NGS coverage corresponding to this regionwill be unusually low. It is therefore possible to detect CNVs byanalyzing the coverage information distribution in the coverageinformation table across the amplicons/regions and across the samples.The coverage count I_(a,s) may be approximately factorized intosample/plex-dependent and amplicon/region-dependent contributions:

I _(a,s) =R _(a,s) F _(a) P _(x,s) +δI _(a,s)  (Eq. 1)

where:

I_(a,s) is the coverage count in the sample s for the amplicon (orregion) a.

F_(a) is the amplification factor specific for the amplicon (or region)a.

P_(x,s) is the factor representing the amount of the DNA materialprocessed for the sample s in the plex tube x (if applicable, forinstance in the case of an amplicon-based technology with several plextubes, where x specifies the plex for the amplicon a).

δI_(a,s) is the coverage noise, which may be assumed to be smallcompared to the total coverage I_(a,s). This overall coverage noise mayresult from various stages of the laboratory procedure (DNA extractionand targeted enrichment) and as well as from the sequencing technologyitself. In a preferred simple model, the coverage noise may be modeledas δI_(a,s)=ε⁽¹⁾ _(a,s)I_(a,s)+ε⁽²⁾ _(a,s)√I_(a,s), where the first termrepresents the intensive part of the noise (proportional to the coveragecount) and the second term represents the Poisson noise arising fromrandom fluctuations of a finite number of sequencing reads (thiscontribution is proportional to the square root of the coverage count).

R_(a,s) is the copy number value (the multiplicity of a givenamplicon/region in a given sample) to be deduced by the proposed CNVdetection method. In most cases (except for sex chromosomes andhomologous regions), R_(a,s)=2 is the normal CNV value, and deviationsfrom this normal value may indicate the presence of a CNV (for example,R_(a,s)=1 and R_(a,s)=3 may correspond to a heterozygous deletion and toa heterozygous duplication, respectively). Note that in germlinesamples, all cells are expected to carry the original individual genomeDNA, so the copy number is an integer.

Since, even in the best laboratory conditions, different samples mayhave slightly different amplification factors F_(a), CNV detection maybe more reliable by comparing each sample to a selected group of“reference samples”, chosen as the samples having the best correlationswith the sample being tested, rather than comparing all samplestogether. In the case of several plexes, the reference samples may bedifferent for different plexes.

The method assumes that the number of samples in one batch issufficiently large (in a possible embodiment, the number of samples isat least eight, but other choices are also possible) and that the CNVsare sufficiently rare, so that for any amplicon/region the majority ofsamples have the regular copy number.

As represented on FIG. 4, the proposed method thus comprises the stepsof:

Normalizing the coverage information by sample s (and plex x ifapplicable), taking into account an estimate of the copy number valuesdetected in the previous iteration (if any) (400). In the firstiteration, the regular (usual) copy number value may be used as anassumption, for instance 2.

For each sample s (referred to as “current sample” below):

-   -   a. Selecting a set of reference samples as the samples with        normalized coverage having the closest coverage pattern to the        normalized coverage of the current sample s (401);    -   b. Normalizing by amplicon/region a using the normalized        coverage of the current sample and the normalized coverage from        its selection of reference samples (402);    -   c. For each amplicon/region a in the current sample,        -   i. Estimating the reference level and noise (410);        -   ii. For each amplicon/region a, estimating the likelihood            for each possible copy-number value (CNV state) (420);        -   iii. Based on the estimated likelihoods of copy-number            values, identifying the actual copy-number values (CNVs) and            their confidence levels (421).

Iterating from step (400), by taking into account the CNVs alreadydetected in the previous iteration, until the detected CNVs stabilize orreach a cycle, or a maximum number of iterations is reached (430);

Filtering the samples by the residual noise and by CNVs found.Optionally, if necessary, repeat the whole procedure from the beginningwith some samples excluded (440).

The individual steps will now be detailed as follows.

Sample- and Plex-Wise Normalization

Due to the specifics of the target enrichment experimental process, theraw coverage information out of the NGS workflow is not normalized. Forinstance there may be a different amount of DNA in each sample/plex,resulting in different coverage information values in the raw resultsalong the sample/plex axis.

In a possible embodiment, in order to remove the sample/plex bias fromthe raw data set, the sample/plex-wise normalization step 400 may betaken as follows: the average over the sample/plex is determined as themean of all the coverage counts I_(a,s) normalized to a single copy,using the copy-number values R_(a,s) calculated in the course of aprevious iteration. The initial copy-number value for all amplicons ofall samples/plexes may be set to the normal value (typically R_(a,s)=2,except in case of sex chromosomes and homologous regions). The coveragedata may then be divided by the calculated mean, separately for eachsample/plex, so that the resulting sample/plex-normalized coverage is ofthe order one:

I ^((norm)) _(a,s) =I _(a,s)/mean(I _(a′,s) /R _(a′,s)|allamplicons/regions a′ in plex x).  (Eq. 2)

In a possible embodiment, the regions with presumed homozygous deletionsR_(a,s)=0 may be excluded from the mean calculation. In a possibleembodiment, the mean is computed as the arithmetic mean. In anotherpossible embodiment, the mean is computed as the geometric mean. Otherembodiments are also possible. As will be apparent to those skilled inthe art, different embodiments of the normalization method may also beused in different iterations of the CNV detection method.

Automated Selection of Reference Samples

The proposed method allows the automatic selection of reference samplesfrom the normalized coverage information out of the target NGSexperiment data, without requiring the user input to provide or manuallyselect dedicated control samples, as will now be described in moredetail.

In general, suitable reference samples may be automatically selected(step 401) by the CNV detection module 123 for each sample/plex as thosehaving the closest coverage pattern to the current sample s₀. In apossible embodiment, the closest coverage pattern may be selected bycalculating for every sample a distance from the current sample s₀, thensorting samples in order of increasing distance, and choosing a certainnumber of samples from the top of the list (having the smallestdistances).

As will be apparent to those skilled in the art of statistics, there aremany possible ways to define and calculate the distance between samples.In a possible embodiment, we first use the sample/plex normalized countsI^((normal)) _(as) to compute thesample/plex/amplicon/copy-number-normalized count as a vector V_(a,s):

V _(a,s)=(I ^((normal)) _(a,s) /R _(a,s))/median(I ^((normal)) _(a,s′)/R _(a,s′)|all samples s′).  (Eq. 3)

In a possible embodiment, the regions with presumed homozygous deletionsR_(a,s)=0 may be excluded from the median calculation.

In a possible embodiment, the distance between any sample s and thecurrent sample s₀ may be defined as the Euclidean distance between thevectors V_(a,s) and V_(a,s0). In a possible embodiment, the correlationbetween the vectors V_(a,s) and V_(a,s0) may be computed.

In the case of amplicon-based technology with several plex tubes, thedistances are preferably calculated separately for each plex x, possiblyleading to different sets of reference samples for different plexes. Forexample, in the case of two plexes (e.g., for CFTR MASTR Multiplicomtargeted enrichment kit), there may be two reference sets for eachsample: one for each plex.

Other ways to define and calculate distance between samples are alsopossible. For example, one may use arithmetic or geometric mean in placeof median or vise versa, or one may use other types of metric in placeof Euclidean metric. The algorithm may also exclude certain regions orattribute different weights to different regions, depending on differentcriteria. In some embodiments, clustering algorithms may be used forassigning the distance. Other types of algorithms are also possible.

After calculating the distances between each sample and the currentsample s₀, the reference samples may be chosen as a certain number ofsamples with the smallest distances. As will be understood by thoseskilled in the art, the number of reference samples should be chosencarefully. This number shall be sufficiently large for good statisticalrelevance; in particular, at each amplicon position, the set ofreference samples should have the majority of normal copy numbers (nomutations). On the other hand, this number shall be sufficiently smallso that only similar samples from the run are compared and outliers arefiltered out.

In a possible embodiment, the number of reference samples N_(R) may beselected as a function of the total number of samples N. In a possibleembodiment, this function may be defined as:

N _(R)=[αN+β√N+γ]  (Eq. 4)

with suitably chosen coefficients α, β, and γ, and [ . . . ] denotingthe integer part. In a possible embodiment, parameters α, β, and γ maybe chosen so that N_(R) 10.25*M+2, to select approximately 25% ofsamples as reference samples. Other choices of coefficients and, moregenerally, functions are also possible. In another embodiment, N_(R) maydepend not only on the total number of samples N, but also on otherproperties of the data, for example, on the level of fluctuations of thecoverage count. Furthermore, in some embodiments, the number ofreference samples may be different for different current samples so (orfor different plexes), for example, if the reference samples areselected as those at a distance below a certain cutoff distance from thecurrent sample s₀. Other choices of algorithms for selecting referencesamples based on the calculated distance are also possible.

While generally it may be beneficial to keep the number of referencesamples N_(R) smaller than the total number of samples N, in order toexclude noisy samples from references, in a possible embodiment, thenumber of reference samples N_(R) may be equal to the total number ofsamples N.

In some embodiments, the choice of specific embodiments of calculatingdistances and/or of selecting reference samples may vary from iterationto iteration. For example, in a possible embodiment, in one of theiterations of the CNV detection (in one possible embodiment,specifically in the second iteration), the number of reference samplesN_(R) is taken to be equal to N: as a consequence, for this oneiteration all the samples are used as reference samples. As will beapparent to those skilled in the art of bioinformatics, this flexiblemethod may improve the detection performance in the case of frequentCNVs.

Amplicon-Wise Normalization

In addition to the plex/sample bias, there may also be coverageinformation divergences between different amplicons/regions, as theamplification efficiency tends to be region-dependent, thus alsoresulting into different coverage information values along theamplicons/region axis. Thus the plex-normalized data may be additionallynormalized amplicon-wise in step 402 by the CNV detection module 123,for instance by dividing the coverage information by the median for eachamplicon a, again using the coverage levels normalized to the assumednormal value of the copy number from the previous iteration (or thenormal copy number in the first iteration). At this step, thenormalization may be performed specifically within a reduced set ofsamples, including the reference samples as selected in step 401:

c _(a,s) =I ^((norm)) _(a,s)/median(I ^((norm)) _(a,s′) /R _(a,s′)reference samples s′)  (Eq. 5)

In one embodiment, the current sample s₀ may be included in the mediancalculation in addition to all reference samples. In another possibleembodiment the current sample s₀ may be excluded from the mediancalculation. In a possible embodiment, the regions with presumedhomozygous deletions R_(a,s)=0 may be excluded from the mediancalculation.

Estimating Reference Level and Noise for Each Amplicon or Region

At this stage of the proposed method, the following data have beencomputed for each sample s:

the set of reference samples (for every plex x, if applicable);

the sample/plex/amplicon normalized coverage levels c_(a,s).

A further step 410 of estimating the reference coverage level and noise(uncertainty) may be further applied as follows.

First, the coverage levels c_(a,s) may be converted into normalizedcoverage levels per copy by using the assumed copy-number value R_(a,s)calculated in the previous iteration:

c ⁽⁰⁾ _(a,s) =c _(a,s) /R _(a,s)  (Eq. 6)

In the very first iteration, the copy number may be assumed to be normalfor all samples and for all amplicons/regions, for instance R_(a,s)=2.

Second, the reference normalized coverage level C_(a) for eachamplicon/region a may be estimated. In a possible embodiment, thenormalized coverage level C_(a) may be assumed equal to one. In anotherpossible embodiment, the coverage level C_(a) may be calculated as themean of the normalized values c⁽⁰⁾ _(a,s) taken over the referencesamples, with the outliers (values deviating from the mean more than acertain threshold, for instance three standard deviations) removed.Other choices of estimating the reference coverage levels are alsopossible, as will be understood by those skilled in the art ofstatistics.

Third, the noise level for each amplicon/region may be estimated. Thenoise level may be defined as the expected relative (divided by themean) root-mean-square uncertainty of the coverage. In a possibleembodiment, the noise level σ_(a,s) may be estimated as

σ_(a,s)=max(σ_(s),σ_(a),1√I _(a,s)),  (Eq. 7)

where:

σ_(s) and σ_(a) are the standard deviations of c⁽⁰⁾ _(a,s) for the givensample and for the amplicon (within the set comprising the given sampleand the reference samples).

1/√I_(a,s) is the relative root-mean square deviation for the Poissonnoise corresponding to the original coverage value I_(a,s).

By using the standard deviations across the samples and across theamplicons, Eq. 7 takes into account the possibility of both noisysamples and noisy amplicons. For calculating σ_(a), it may be beneficialto exclude outliers. In a possible embodiment, the data points outsidethe 3σ interval may be excluded. As will be apparent to those skilled inthe art of statistics, other ways to estimate the noise level σ_(a,s)may also be used, in place of Eq. 7.

In some embodiments, the step of estimating the reference normalizedcoverage level C_(a) for each amplicon/region a may be modified in oneof the first iterations (for example, in one possible embodiment,specifically in the first iteration), in order to detect CNVs that spana large fraction of all the amplicons/regions. In such a case, a simplenormalization by sample may not enable to estimate a correct referencelevel. This problem may be solved using an additional algorithm thatdetermines the reference level either on the basis of a special set of“control” amplicons/regions (that are assumed to be CNV free in theirmajority) or on the basis of the best match of the normalized levels tointegers.

Calculating Likelihoods for Different Copy Numbers

At this step, the normalized coverage levels and the noise levels foreach amplicon/region a and for each sample s may be further convertedinto log-likelihoods L_(a) (defined as the minus logarithm of thelikelihood of a particular coverage level, under the assumption of agiven copy number and assuming a given noise level) (step 420). In apossible embodiment, the model of a Gaussian noise may be used forcomputing the log-likelihoods L_(a):

L _(a)(r)=min((c _(a,s)/(rC _(a))−1)²/(2σ_(a,s) ²),L _(max)),  (Eq. 8)

where c_(a,s), C_(a), and σ_(a,s) are respectively the coverage, thereference normalized coverage level, and the noise level for the currentsample s and amplicon/region a. The log-likelihoods L_(a)(r) may thus becalculated for all integer values of r ranging from 0 to a certainmaximal value (in one embodiment, we have chosen the maximal value of rto equal 6, but other embodiments are also possible). In case r=0, Eq. 8may be replaced by L_(a)(0)=min((c_(a,s)/C_(a))²/(2 σ₀ ²), L_(max)),where σ₀ is the assumed noise level for a full deletion (in oneembodiment, we chose σ₀=0.01, but other choices are also possible). Thelog-likelihood may be capped by a certain value L_(max) in order to takeinto account that large fluctuations do not obey the normaldistribution.

In some embodiments, L_(max) may depend on the amplicon/region a. Inother embodiments, other noise models may be used, in place of Eq. 8.

Finding CNVs and their Confidence Levels

Using the log-likelihoods calculated in the previous step, the mostlikely CNV states may be found and their confidence levels may becalculated. In a possible embodiment, the Hidden-Markov-model (HMM)method may be used for this purpose (step 421). Examples of the use ofHMM in CNV detection can be found for instance in S. Ivakhno et al.,“CNAseg—a novel framework for identification of copy number changes incancer from second generation sequencing data”, Bioinformatics (2010)26(24):3051-3058. Other embodiments are also possible, for instance asimple comparison of L_(a)(r) with a suitably chosen threshold,similarly to the MLPA recommended procedure, as known to those skilledin the art.

In a possible embodiment, the HMM method may be realized by defining theHMM score as:

S _(HMM)({r _(a)})=Σ_(a)(L _(a)(r _(a))+p _(nb)(r _(a))+p _(sw)(r _(a),r _(a+1)),  (Eq. 9)

where the HMM score S_(HMM)({r_(a)}) is a function of the set of theassumed copy numbers r_(a) for every amplicon/region in the currentsample, L_(a)(r_(a)) are the log-likelihoods calculated at the previousstep, and p_(nb)(r_(a)) and p_(sw)(r_(a),r_(a+1)) are additionalpenalties associated with the non-normal copy number and with atransition different copy numbers between neighboring amplicons/regions(denoted as a and a+1). The parameters p_(nb)(r_(a)) andp_(sw)(r_(a),r_(a+1)) may be chosen to provide a good performance andreflect the Bayesian prior expectations of having CNVs in the sample. Ina possible embodiment, the functions p_(nb)(r_(a)) andp_(sw)(r_(a),r_(a+1)) may be chosen to be independent from theregion/amplicon a, but in other embodiments they may themselves dependon the region/amplicon a. For example, in other possible embodiments,the functions p_(nb)(r_(a)) and p_(sw)(r_(a),r_(a+1)) may be functionsof the length of the region, of the gaps between regions, of thepossible overlaps between amplicons, or of the known collectedstatistics of CNVs in a given region. Other embodiments are alsopossible.

Once the HMM score is defined by Eq. 9, the forward-backward algorithmmay be used, as known to those skilled in the art, to find the set ofCNV states {r_(a)} which minimize the HMM score and the set of“confidence” values. The confidence value at a position a may be definedas the minimal possible increase of the HMM score with the state r_(a)differing from its optimal value. The statistical meaning of thisconfidence is the negative logarithm of the probability of error indetermining r_(a).

In practice, two versions of confidence may be introduced:

“numerical” confidence for determining the exact number of copies;

“variant” confidence for classifying the state asnormal/insertion/deletion without specifying the exact value for thecopy number (in case of insertions or deletions). The “variant”confidence is thus always higher or equal than the “numerical”confidence.

All parameters may be further optimized for better performance and/orthe HMM model may be adapted in various ways as will be apparent to oneskilled in the art. Other models for HMM may also be applied. Forinstance, it may be worth using different penalties for insertions anddeletions and introduce a dependence of the switching penalty on thedifference between the CNV states r_(a) and r_(a+)1, based on statisticsfor all sorts of possible CNVs.

Main Iteration

As apparent from the steps described above, the noise andreference-value calculations depend on the presumed copy-number values(the so-called “prior estimate of the copy-number values”), with thenormal value, for instance 2, being the starting point. In order todetermine the CNV states self-consistently, the algorithm may beiterated several times (step 430), using the result of the CNV detectionat each iteration as the prior estimate of the copy-number values in thesubsequent iteration. Practical experiments showed that, in case of highquality data, a few iterations are sufficient to efficiently detect thereal CNV values. Sometimes (in case of noisy data), the iteration mayenter a periodic cycle. In that case, the algorithm may be stopped assoon as a cycle pattern is detected.

Some of the targeted enrichment technologies may include so called“control amplicons/regions”: amplicons or regions outside the regions ofinterest, typically broadly distributed across the genome. Such controlamplicons may be used to normalize the coverage information. In oneembodiment, only the coverage information from control amplicons is usedin the sample/plex normalization (step 400) in the first iterationspecifically. This may allow more robust detection of possible largeCNVs (e.g., deletions of the whole gene) as in such a case, at the firstiteration the copy-number value R_(a,s) for the large CNV region will beset to the correct value and will preserve this value throughoutsubsequent iterations. Otherwise, the control amplicons may be used onthe same footing as the test amplicons in the calculations of the noiseand reference values, but they do not need to be included in the HMMpart of the algorithm.

Final Filtering

After the last iteration, the proposed method provides the resultingvalues of the CNV levels as well as their confidence level. In apossible embodiment, a minimal threshold may be set for the confidencevalue, below which the results for individual amplicons may be assumedto be “unreliable” and may be filtered out from the results (step 440).In other embodiments, certain samples may also be excluded as“unreliable” based on the residual sample noise σ_(s) or on the residualnoise in one of the plexes or on the unrealistically large number ofCNVs detected. The precise conditions of the labeling the sample as“unreliable” may depend on the details of the target-enrichment andsequencing technologies.

The CNV results for “unreliable” samples may be discarded from the finalresults in step 440. Finally, if too many samples (in one embodimentmore than half of all the samples) are filtered out as “unreliable”, thewhole procedure may need to be repeated from the beginning, with the“unreliable” samples excluded. This option may provide betterperformance for runs where a large fraction of samples had technicalproblems at the initial target-enrichment or sequencing steps in theoverall analysis workflow.

Optimization (Improvements)

In some embodiments, further improvements of the algorithm may beapplied. One possible further embodiment may apply principal-componentfiltering, similar to M. Fromer et al., “Discovery and statisticalgenotyping of copy-number variation from whole-exome sequencing depth”,Am. J. Hum. Genet. (2012) 91:597-607. A principal-component filter maybe applied to the original dataset, for instance before the main CNVdetection algorithm steps 400 to 430, or on the preliminary normalizeddataset, for instance after normalization by sample/plex (step 401) orafter normalization by amplicon/region (step 402). In one embodiment,the filter is trained once on a specially chosen training dataset andsubsequently used without further updates. In other embodiments, thefilter may include learning from new datasets.

In yet another embodiment, as represented on FIG. 1, the outcome from aparallel SNP/INDELs detection module 124 may also be used as an input tothe CNV detection module 123 to further strengthen the CNV detection. Inthis case, the information on the coverage fraction for heterozygousSNPs may be used to bias the decision on the CNV values. For example, a33% SNP fraction may be a strong argument in favor of a duplication(copy number equal 3). In a possible embodiment, this bias may beintroduced at the HMM step 421 by adding a suitably chosen contributionto p_(th)(r_(a)) for a region, where one or several heterozygous SNPsare found.

In a further possible embodiment, the proposed CNV detection method maybe adapted to the case of homologous (identical or nearly identical)regions or pseudogenes. In this case, the normal copy-number value maybe different from 2 (e.g., in the case of one pair of homologousregions, the normal copy-number value equals 4). The CNV detectionalgorithm may be generalized to apply to this case by adjusting thenormal copy number assumed value (e.g. from 2 to 4) and by using thetotal number of reads in all the regions homologous to the consideredone in the main CNV algorithm steps 400 to 430. Additionally, thecoverage differences between homologous regions may be used in a waysimilar to heterozygous SNPs in the former embodiment description tobias the parameters p_(nb)(r_(a)).

Yet another situation where the normal copy number may differ from 2 isthe case of sex chromosomes (X and Y chromosomes). In a possibleembodiment, the normal copy number for regions in X and Y chromosomes isadjusted depending on the sex of the patient. In a further possibleembodiment, the sex of the patient may be determined automatically bycomparing the coverage information between X chromosome, Y chromosome,and autosomes, depending on their presence in the target amplificationtechnology.

Experimental Results

The efficiency of the proposed method as depicted by the FIG. 4flowchart to detect CNV variants has been compared to the MLPA method onan experiment comprising 474 samples in 11 batches originating from onelaboratory, with the BRCA TrusSeq technology and MiSeq next generationsequencing pipeline. One known feature of the BRCA TruSeq assay is thatit has many short amplicons, but the coverage noise is relatively high,and it does not have control amplicons. A deletion of a whole gene(BRCA1 or BRCA2) may nevertheless be detected from comparing thecoverage levels between the genes in accordance with the proposediterative method.

These samples contained 16 CNVs confirmed by MLPA. The same samples wereanalyzed independently using our CNV module. The results were thencompared to the MLPA-confirmed variants. The detection sensitivity,measured as the percentage of the CNVs captured by the algorithm incomparison to the MLPA method, was measured at 100%, i.e. all 16 CNVswere successfully detected. Moreover, the percentage of samples withrejections or false positives recommended for re-testing for CNVs byeither the same or an alternative method (the lower the better) wasmeasured at 4.2% while a maximum value of 10% may be acceptable in thecase of the best laboratory practice. The proposed genomic data analysismethod therefore enables to reach similar CNV detection sensitivity andaccuracy as the state of the art MLPA method, while enabling the use ofa single target NGS experiment pipeline, which brings significantpractical advantages in research or clinical practice.

Other Embodiments and Applications

Although the detailed description above contains many specific details,these should not be construed as limiting the scope of the embodimentsbut as merely providing illustrations of some of several embodiments.

While various embodiments have been described above, it should beunderstood that they have been presented by way of example and notlimitation. It will be apparent to persons skilled in the relevantart(s) that various changes in form and detail can be made thereinwithout departing from the spirit and scope. In fact, after reading theabove description, it will be apparent to one skilled in the relevantart(s) how to implement alternative embodiments.

In addition, it should be understood that any figures which highlightthe functionality and advantages are presented for example purposesonly. The disclosed methods are sufficiently flexible and configurablesuch that they may be utilized in ways other than that shown.

Although the term “at least one” may often be used in the specification,claims and drawings, the terms “a”, “an”, “the”, “said”, etc. alsosignify “at least one” or “the at least one” in the specification,claims and drawings.

Finally, it is the applicant's intent that only claims that include theexpress language “means for” or “step for” be interpreted under 35U.S.C. 112, paragraph 6. Claims that do not expressly include the phrase“means for” or “step for” are not to be interpreted under 35 U.S.C. 112,paragraph 6.

What is claimed:
 1. A method for detecting copy-number values (CNV) froma pool of DNA samples enriched with a target enrichment technology, eachenriched DNA sample being associated with a library of pooled fragmentsfrom a set of amplicons/regions, each amplicon/region being sequencedwith a high-throughput sequencer to generate coverage count for eachsample and for each amplicon/region, comprising: normalizing, with adata processing unit, the coverage count associated with each sample;selecting, with a data processing unit, for each sample, a set ofreference samples as the samples with the closest normalized coveragecount to the normalized coverage count of said sample, the number ofreference samples in each subset of reference samples being a functionof the total number of samples and being smaller than the total numberof samples; for each sample, estimating the copy-number values in saidsample as a function of at least the coverage counts in said sample andof at least the coverage counts in the selected set of reference samplesfor said sample.
 2. The method of claim 1, wherein the number ofreference samples N_(R) in each set of reference samples is given byN_(R)=10.25*M+2, where N is the total number of samples.
 3. The methodof any of the preceding claims, further comprising: for each sample andfor each amplicon/region, estimating the likelihood for each possiblecopy-number value.
 4. The method of claim 3, wherein a Hidden MarkovModel is further used to estimate the copy-number values and theirconfidence levels for each amplicon/region.
 5. The method of claim 4,further comprising: excluding possible copy number values for which theconfidence level is below a minimum threshold.
 6. The method of claim 1,wherein the estimate of the copy-number values is calculated usinginformation on the SNP fractions and coverage.
 7. The method of claim 1,further comprising: applying a principal-component filter to thecoverage count.
 8. The method of claim 1, wherein normalizing thecoverage count associated with each sample depends on a prior estimateof the copy number values for each sample and each amplicon/region. 9.The method of claim 1, wherein selecting a set of reference samplesdepends on a prior estimate of the copy number values for each sampleand each amplicon/region.